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We show that Kolmogorov's broken stick model describes oscillations in the coverage signal ob- 
tained from high-throughput transcriptomic experiments. We consider improved predictions of gene 
ends in poorly annotated organisms using results from this theory. 



INTRODUCTION 

All the common Next Generation Sequencing (NGS) 
platforms currently available on the market proceed by 
sequencing in parallel many short DNA molecules, typi- 
cally 25-500 nt in length Those short molecules are 
obtained by fragmentation of DNA/RNA molecules using 
either physical or biochemical means like nebulisation, 
sonication or random enzymatic digestion |2J. 

On the SOLiD platform, in the case of single reads se- 
quencing, fragmentation and size selection are performed 
targeting fragments significantly longer than the read 
length [3 . Consequently, only the first nucleotides of 
each fragments are read by the machine leading to a sys- 
tematic truncation of the fragments 3'end. This effect is 
often ignored based on the assumption that the bias av- 
erages out for a large number of fragments coming from 
multiple identical molecules. 

Nevertheless, we report here that this truncation 
causes observable effects in RNA sequencing (RNA-seq) 
data close to transcripts 3 '-ends that are in agreement 
with a simple theory based on Kolmogorov's Broken Stick 
model (4]. 



THEORY 

An unbiased RNA fragmentation can be seen as a se- 
quential random process that, for N fragments at a step 
n creates A^ + l fragments at step n+1 by selecting an ex- 
isting fragment independently of its length and breaking 
it in two not necessarily equal pieces. Such a process is a 
particular case of Kolmogorov's Broken Stick model, and 
leads to fragments with lengths following a log-normal 
distribution [H [5] with probability distribution function 
given by 

, n 1 (lnl-m) 2 



-fGH- "-"Mb)' (2) 

where Lo is the average and a 1 the variance of the frag- 
ments length distribution. 

We set up a Monte Carlo scheme that creates frag- 
ments with such a distribution out of the last M nu- 
cleotides (nt) of numerous identical copies of an RNA 
molecule and builds Ck(Lo, a), the coverage - a signal in- 
dicating the number of reads mapped at each position of 
the genome - k base pairs (bp) ahead of the transcript 
3'end that one expects to obtain provided that only the 
first TV base pairs of each fragment are sequenced, i.e. 

c k (L ,a) = A M*i~*), fc=l,2,...M, (3) 

where Xj corresponds to the distance from the first nu- 
cleotide of a fragment i to the 3'end on the original RNA 
molecule, A is an arbitrary constant and 1n(%) is an indi- 
cator function equal to one on [0, N] and zero elsewhere. 
Each fragment i is taken from a set Il ,<t obtained by 
generating fragments with lengths distributed according 
to equation ([!} and excluding the ones shorter than the 
read length N. 

Provided that the parameters Lq and a are known for 
a given experiment and that the existence of a 3'end is 
known within a genomic region D of a reasonable size, 
its location z can be obtained by maximising the over- 
lap between the predicted pattern and the experimental 
coverage, i.e. by solving 

M 

argmaxy^Cfc(£ ,(j) c k . z , (4) 
zeD k=o 

where Ck. z denotes the coverage obtained from the exper- 
iments k base pairs ahead of a genomic position z. 

EXPERIMENTAL DATA 



where L is the fragment length and the parameters m 
and s are given by 



We analysed a total of 6 datasets from RNA-seq exper- 
iments performed on Enteroccocus Faecalis strain v583. 
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Four of those originate from a single RNA extraction 
sequenced on SOLiD v3 using the plain single stranded 
RNA sequencing protocol from the platform manufac- 
turer. In experiments labeled S3srlA and S3srlB, all 
RNA was sequenced while in S3srOA and S3srOB the ribo- 
somal RNA (rRNA) was removed beforehand using Am- 
bion MICRO BExpress Bacterial mRNA Enrichment Kit. 
The different suffixes A or B correspond to independent 
RNA-seq performed on the same RNA sample separated 
after fragmentation, just before the step ligating sequenc- 
ing adapters, thus testing the reproducibility of the last 
steps of the library preparation and the sequencing itself. 

The two datasets labelled S55srl and S55rrl corre- 
spond to single strand RNA-seq on SOLiD 5500 from 
two RNA extractions obtained from bacteria grown in 
different growth conditions. Ahead of manufacturer's 
protocol, short RNA oligos were ligated to the 5 'ends 
as described in [5]. 

An important aspect of the protocol is that, in ev- 
ery case, RNA (or DNA) molecules are first fragmented 
and size selected targeting a length greater than 100 nt 
[3]- Those fragments are then amplified and attached to 
beads in an emulsion PCR before being sequenced. The 
sequencing itself is performed on the 50 first nucleotides 

- or more precisely on the first 50 nucleotides transi- 
tions, as obtained with SOLiD 2-base color encoding [7 

— of each fragment, leaving the remaining fraction of the 
molecule unread. 

In every case, the sequencing generated 60 to 100 mil- 
lion reads, each with a length corresponding to 50 color 
encoded nucleotides transitions. Reads were aligned to 
the genome using Bowtie version 0.12.7 [5] with default 
options but allowing for multiple matches. Due the 
specificities of the aligner, reads are converted to 48 
bp long fragments during the alignment, thus reducing 
the practical read length to this number. Between 40 
and 60% of reads were mappable, which corresponds 
to an average coverage over 450x of the 3.2Mbp of the 
bacterial genome. After alignment, we calculate the 
coverage and take multiply mapped reads into account 
by dividing their contribution to the read count by the 
number of matches. 



In order to confirm that RNA fragmentation results 
into the postulated log-normal distribution, we measured 
the fragment length distribution for another RNA sample 
prepared for future RNA-seq on an Agilent 2100 Bioana- 
lyzer (FIG. [I]). The match with a log-normal distribution 
is clear except for a peak corresponding to fragments of 
13 nt, which corresponds to synthetic RNAs artificially 
added to this particular sample. The sample used here 
was prepared using a different protocol and cannot be 
quantitatively compared to the samples used for the se- 
quencing experiments previously described. 
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Figure 1: Length distribution of an RNA sample measured 
with an Agilent 2100 Bioanalyzer after fragmentation and li- 
brary preparation adding adapters with total constant length 
of 128 nt. The superimposed plot shows a log-normal distri- 
bution with mean 65 and standard deviation 60 shifted by 
128 bp to the right. The peak observed in the measurement 
at 141 bp corresponds to an excess of synthetic short RNA 
sequences artificially added to the particular sample used. 



TRANSCRIPTS 3'ENDS 

In bacteria, the rho-independent or intrinsic termina- 
tor is one of the two major mechanisms for transcrip- 
tion termination HO] and can be quickly and reli- 
able predicted using the computational tool Transter- 
mHP [TT]. We predicted those terminators on the chro- 
mosome of E. Faecalis v583 (NCBI reference AE0 16830) 
using TranstermHP v2.08 with default options. Out of 
a total of 1851 predictions, only the 1229 terminators 
found with 100% confidence were retained. 

It can be easily seen that the signal falls to zero well be- 
fore the location of the terminator and has a pronounced 
peak around position -80 nt. Ahead of the peak, one can 
notice 2 to 3 periods of a weak oscillation with a period 
on the order of lOObp, with no plausible biological origin. 

Ahead of each selected terminator, we collected the 
read coverage obtained from RNA-seq on a window of 
400 nt starting from the last nucleotide of the termina- 
tor stem-loop as predicted by TranstermHP. It is worth 
noting that real 3'end of the corresponding transcript is 



Position relative to last nucleotide of stem-loop (nt) 



Figure 2: Normalized coverage signal obtained from the 
S3srlB dataset ahead of rho-independent terminator averaged 
over 1229 rho-independent terminators predicted with 100% 
confidence on E. faecalis' chromosome using TranstermHP. 
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S3srOA La = 110 , a = 30.2, d = 6 err = 5.47% 




S3srOB Ln = 109.6 , a = 30.9, d = 7 err = 4.7% 
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Figure 3: Comparison between the signals obtained as described in section Transcripts 3'ends from the 6 RNA-seq experiments 
and the ones predicted by the theory after fitting the parameters Lq, o and d. 



located downstream of this location, usually after a 4 to 
9 nt long poly-uracil (poly-U) tail [9JQ2]. The signal was 
normalised so that its integral over the region of interest 
is unitary. Averaging over all the 1229 terminators, for 
the case of the S3srlB dataset, lead to the signal pre- 
sented in FIG. |2 



RESULTS AND DISCUSSION 

We use equation (|3]) from the previously described 
Monte Carlo scheme with a read length N — 48 nt to 
generate the coverage expected from our theory. In ev- 



ery cases, the scheme creates random fragments from 10 5 
copies of the original whole molecule, which appears to 
be sufficient for a relative accuracy beyond 10 -2 . Addi- 
tionally, we introduce a parameter d that corresponds to 
a positional shift representing the average distance be- 
tween the last nucleotide of the stem-loop in the rho- 
independent terminator as predicted by TranstermHP 
and the real 3 'end of the corresponding molecule, i.e. 
the length of the poly-U tail following the stem-loop. We 
fit d and the mean Lq and the standard deviation a of 
the underlying log-normal distribution. We output the 
coverage at each position over 350 nt ahead of the last 
nucleotide of the stem-loop. We measure the error e be- 
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Nb. in [-50,50] 


Nb. in [4,9] 


Mean 


Std. 


Nai've approach 


745 


32 


24.83 


13.41 


Pattern overlap 


744 


67 


22.80 


16.08 



Table I: Summary of the data presented in FIG. [4] showing 
for both methods the number of predictions within certain 
ranges and the mean and standard deviation of the absolute 
distance between a prediction and the expected location of the 
3'end, assumed to be 7 nt downstream the last nucleotide in 
the stem-loop of the rho- independent terminator, as obtained 
from the fit in FIG. [jjjb) 

tween the theoretical coverage c obtained using the fitted 
parameters and c , the one obtained from the data, as 

350 / 350 

e = $> J+d -Q) 2 / ^Tcf . (5) 

i=l I i=l 

Fitting the two datasets from the S3sr0 type (FIG.[3|a) 
and (b)) results in an average fragment length of 110 nt 
and an estimated poly-U tail length d of 6 nt, which is 
in good agreement with what is commonly expected for 
this tail jjJJ[T5]. On the other hand, the datasets of the 
S3srl series (FIG. gc) and (d)) result in a shorter esti- 
mated average fragment length and a poly-U tail length 
of 17 nt, which seems unrealistically high. As there is no 
reason for the two couples of experiments to differ in the 
length of the poly-U tail (the two samples were prepared 
using the same protocol from the same RNA extractions 
and differ only by the removal of rRNA), we attribute 
this difference to an issue with data fitting and note that 
underestimating Lq causes an overestimate of d. 

Furthermore, we note that the results for the A' and 
'B' samples of S3sr0 and S3srl are very similar to each 
other in every aspect. Since the members of each pair 
were separated after the RNA fragmentation step, this 
indicates a good reproducibility of the library prepara- 
tion and sequencing, and implies that the observed ef- 
fects depend only on the steps preceding and including 
the fragmentation. 

Finally, the datasets of the S55 series lead to much 
worse fits (FIG.|3jd) and (e)) where only the right-most 
part of the signal corresponding to the fall before the 
terminator can be reproduced. Overall, the signal 
obtained from the data seems much more noisy and the 
pattern is less pronounced than previously. This effect 
is likely related to the different protocol used in those 
experiments: the inclusion of tags at the beginning of 
RNA molecules modifies their lengths in a non controlled 
way, causing a blurring of the pattern. Nevertheless, the 
parameters obtained from the fit are realistic and in line 
with those of the other four datasets. 

We now focus on the dataset S3srlB that gives the 
lowest relative error and use our theory to predict indi- 
vidual 3 'ends near the 1229 predicted rho-independent 
terminators. We use as initial guess the location of the 
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Figure 4: Distribution of distances between the last nucleotide 
in the stem-loop of the rho-independent terminator and pre- 
dictions of 3'end of the corresponding transcript obtained ei- 
ther by a nai've approach searching for the closest locus where 
the signal goes to zero (top) or using the maximum pattern 
overlap method from equation Q (bottom) for the 1229 ter- 
minators predicted by TranstermHP with 100% confidence. 
The vertical dashed lines delimits the region between 4 and 
9 nt downstream the terminator stem-loop where the actual 
3'end is expected. 



last nucleotide of the stem-loop and search within a re- 
gion of 100 bp centered around this guess. We compare 
on FIG. [4] and TABLE [I] the predictions obtained by op- 
timising equation Q to a naive approach that detects 
the locus closest to the initial guess where the coverage 
goes to zero. We notice that the number of predictions 
within a range of 100 bp around the initial guess is sim- 
ilar for both methods, while the number of predictions 
falling between 4 and 9 bp downstream of the last nu- 
cleotide of the terminator stem-loop (the location where 
the real 3'end of the transcript is expected) is more than 
twice higher for the maximum pattern overlap method. 
We also notice that the average absolute distance to the 
expected 3'end is slightly improved by using our theory, 
but its standard deviation is worse. Finally, we point out 
that the naive approach predicts many 3 'ends far down- 
stream the terminator (+30 bp and and further) which 
are not biologically relevant. This effect is not present 
while using predictions based on our theory. 

The reason for having only such relatively modest 
improvements is that, while we have shown that the 
proposed theory works well for the average signal ahead 
of gene ends, the pattern is not directly observed when 
considering the read coverage of one particular tran- 
script. Often, the signal shows one or several box-shaped 
regions of high signal, with a position and periodicity 
only very roughly in agreement with the theory (FIG. 
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(c) Region ahead of terminator at 595281 - 595333 

Figure 5: Examples of the read coverage ahead of predicted 
terminators showing patterns very different from the one pre- 
dicted by the theory. 



[5ja) and (b) ). In other situations, the pattern may 
be completely absent (FIG. |5jc), in particular the blue 
curves). Those effects are likely related to sequence- 
dependent biases in the fragmentation process that can 
induce preferential cutting sites in multiple copies of 
the same transcripts, thus invalidating the assumptions 
of random fragmentation of our theory. While consid- 
ering many transcripts with different sequences, such 
biases average out, leading to the previously observed 
agreement. The second observed effect may be related 
to RNA degradation from 3'end in the cell: at the 
moment of RNA extraction, multiple copies of the same 
transcripts that have been engaged in degradation are 
likely to have different length and thus their 3'end at 
different but neighbouring genomic locations. Such a 
mechanism will blur the pattern described above and, 
by shortening RNA molecules, will shift the predictions 
upstreams. Due to those shortcomings, a more detailed 
model taking fragmentation biases and potential RNA 
degradation into account will be necessary to provide 
reliable predictions for the location of the 3'end of a 
transcript. 

A direct consequence of the mechanisms described in 
this work is that SOLiD single strand RNA-seq cannot be 
used for accurate determination of 3' end of transcripts. 
The potential truncation of transcripts ends can cause 
problems while comparing results from RNA-seq to other 
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Figure 6: Visualisation of the read coverage from the S3srOA 
dataset in a region containing a ncRNA detected using mi- 
croarray [15) . The region of the ncRNA is marked with verti- 
cal bars. While both methods are in perfect agreement for the 
5'end, a naive interpretation of the coverage from RNA-seq 
would place the 3'end of the transcript at the position indi- 
cated by the dashed line, well before the end of the transcript 
as measured using microarray. 



measurement methods like Northern blot, 3'RACE or mi- 
croarray (example in FIG. ^ ). Furthermore, RNA tran- 
scripts longer than the read length (here 50 nt) but sig- 
nificantly shorter than the targeted average length (here 
100-120 nt) may simply be totally absent from the final 
results while one would naively expect them to be well 
visible. 



CONCLUSIONS 

We demonstrated that a simple theory based on Kol- 
mogorov's broken stick model explains very well the type 
of signal observed on average near transcripts 3 'ends in 
high throughput sequencing experiments. 

The use of this theory for prediction of the 3'end of in- 
dividual transcripts has shown only minor improvements 
compared to a naive approach. The main reason for this 
is the presence of clear preferential cleavage sites on tran- 
scripts that break the assumption of unbiased fragmen- 
tation, essential in our theory. 

Most importantly, we have demonstrated that drawing 
conclusions about transcripts 3 'ends from single stranded 
RNA-seq experiments on SOLiD must be done with great 
care. A good understanding of all the mechanisms in 
the whole RNA-seq pipeline is necessary to give correct 
interpretation to experimental results. 
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